正则替换构造headers

在写爬虫构建请求的时候,不可避免地要添加请求头( headers ),一般通过添加 user-agent 中的数据实现,但是如果数据很多就会比较麻烦

请求头范例

例如以下user-agent,如果要构造成字典形式,需要逐行添加逗号,引号等,采用以下两种方法可以快速构造请求头

age: 21
cache-control: no-cache,no-store,private
cdn-ip: 2408:8752:300:6:0:1:2:11
cdn-source: baishan
cdn-user-ip: 2408:84ef:10:b3d0:c176:8470:c886:c23
content-encoding: gzip
content-type: text/html; charset=GBK
date: Mon, 16 Mar 2020 13:59:17 GMT
expires: Mon, 16 Mar 2020 14:00:16 GMT
server: nginx
status: 200
vary: Accept-Encoding
x-cache-remote: HIT
x-content-from: netease
x-ser: BC52_dx-lt-yd-shandong-jinan-5-cache-6, BC52_dx-lt-yd-shandong-jinan-5-cache-6, BC75_lt-hunan-yueyang-2-cache-4

使用Sublime正则替换

打开sublime,使用替换功能,在需要替换的方框内输入

(.*?): (.*)

替换成

"$1": "$2",

其中 "$1 ", "$2" 分别表示匹配组,即匹配成功的原文

使用Python正则替换

当然可以直接写程序实现

# 引入正则
import re

# 原始的user-agent
header = '''
age: 21
cache-control: no-cache,no-store,private
cdn-ip: 2408:8752:300:6:0:1:2:11
cdn-source: baishan
cdn-user-ip: 2408:84ef:10:b3d0:c176:8470:c886:c23
content-encoding: gzip
content-type: text/html; charset=GBK
date: Mon, 16 Mar 2020 13:59:17 GMT
expires: Mon, 16 Mar 2020 14:00:16 GMT
server: nginx
status: 200
vary: Accept-Encoding
x-cache-remote: HIT
x-content-from: netease
x-ser: BC52_dx-lt-yd-shandong-jinan-5-cache-6, BC52_dx-lt-yd-shandong-jinan-5-cache-6, BC75_lt-hunan-yueyang-2-cache-4
'''

# 构造空字符串
headers = ''''''
# 正则替换
m = re.findall(r'(.*?): (.*)',header)
for i in m:
    text = f'''"{i[0]}": "{i[1]}",'''
    if i != m[-1] :
        headers = headers + text + '\n'
    else :
        headers = headers + text
# 输出结果
print(headers)

输出

"age": "21",
"cache-control": "no-cache,no-store,private",
"cdn-ip": "2408:8752:300:6:0:1:2:11",
"cdn-source": "baishan",
"cdn-user-ip": "2408:84ef:10:b3d0:c176:8470:c886:c23",
"content-encoding": "gzip",
"content-type": "text/html; charset=GBK",
"date": "Mon, 16 Mar 2020 13:59:17 GMT",
"expires": "Mon, 16 Mar 2020 14:00:16 GMT",
"server": "nginx",
"status": "200",
"vary": "Accept-Encoding",
"x-cache-remote": "HIT",
"x-content-from": "netease",
"x-ser": "BC52_dx-lt-yd-shandong-jinan-5-cache-6, BC52_dx-lt-yd-shandong-jinan-5-cache-6, BC75_lt-hunan-yueyang-2-cache-4",